<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<html>

<head>
	<title>Configuring Xymon Alerts</title>
</head>

<body>
<h1>Configuring Xymon Alerts</h1>
<p>When something breaks, you want to know about it. Since you probably 
dont have the Xymon webpages in view all of the time, Xymon can 
generate alerts to draw your attention to problems. Alerts can go out
as e-mail, or Xymon can run a script that takes care of activating
a pager, sending an SMS, or however you prefer to get alerted.</p>
<ul>
	<li><a href="#simple">A simple alert configuration</a></li>
	<li><a href="#keywords">Configuration file keywords</a></li>
	<li><a href="#wildcards">Using regular expressions for names</a></li>
	<li><a href="#scripts">Alering via a script</a></li>
	<li><a href="#macros">Using macros</a></li>
	<li><a href="#ignorerules">There are rules ... and exceptions: IGNORE</a></li>
</ul>

<h3><a name="simple">A simple alert configuration</a></h3>
<p>The configuration file for the Xymon alert module is <em>~/server/etc/alerts.cfg</em>.
This file consists of a number of <em>rules</em> that are matched against
the name of the host that has a problem, the name of the service, the
time of day and a number of other criteria. Each rule then has a number
of <em>recipients</em> that receive the alert. For each recipient you can
further refine the rules that need to be matched. An example:</p>
<pre>
	HOST=www.foo.com
		MAIL webmaster@foo.com SERVICE=http REPEAT=1h
		MAIL unixsupport@foo.com SERVICE=cpu,disk,memory
</pre>

<p>The first line defines a <em>rule</em> for alerting when something breaks on the host 
"www.foo.com".<br>
There are two recipients: <tt>webmaster@foo.com</tt> is notified if it is the "http"
service that fails, and the notification is repeated once an hour until the problem
is resolved.<br>
<tt>unixsupport@foo.com</tt> is notified if it is the "cpu", "disk" or "memory" 
tests that report a failure. Since there is no "REPEAT" setting for this recipient, 
the default is used which is to repeat the alert every 30 minutes.</p>

<p>OK, suppose now that the webmaster complains about getting e-mails at 4 AM in the
morning. The webserver is not supposed to be running between 9 PM and 8 AM, so even though
there is a problem, he doesn't want to hear about it until 7:30 - that gives him just 
enough time to fix the problem.  So you must modify the rule so that it doesn't send out 
alerts until 7:30 AM:</p>
<pre>
	HOST=www.foo.com
		MAIL webmaster@foo.com SERVICE=http REPEAT=1h TIME=*:0730:2100
		MAIL unixsupport@foo.com SERVICE=cpu,disk,memory
</pre>
<p>Adding the <em>TIME</em> setting on the recipient causes the alerts <i>for this recipient</i>
to be suppressed, unless the time of day is within the interval. So with this setup, the
webmaster gets his sleep.</p>

<p>What would have happened if you put the <em>TIME</em> setting on the <i>rule</i> instead
of on the <i>recipient</i> ? Like this:
<pre>
	HOST=www.foo.com TIME=*:0730:2100
		MAIL webmaster@foo.com SERVICE=http REPEAT=1h
		MAIL unixsupport@foo.com SERVICE=cpu,disk,memory
</pre>
<p>Well, the webmaster would still have his nights to himself - but the TIME setting would then
also apply to the alerts that go out when there is a problem with the "cpu", "disk" or "memory"
services. So there would not be any mails going to <tt>unixsupport@foo.com</tt> when a disk
fills up during the night.</p>

<h3><a name="keywords">Keywords in rules and recipients</a></h3>
<p>These are the keywords for setting up rules:</p>
<table width="80%" align="center" summary="alerts.cfg keywords">
	<tr><th align="left" valign="top">PAGE</th><td>rule matching an alert by the name of the page the host is displayed on. This is the name following the "page", "subpage" or "subparent" keyword in the hosts.cfg file.</td></tr>
	<tr><th align="left" valign="top">EXPAGE</th><td>rule excluding an alert if the pagename matches.</td></tr>
	<tr><th align="left" valign="top">HOST</th><td>rule matching an alert by the hostname.</td></tr>
	<tr><th align="left" valign="top">EXHOST</th><td>rule excluding an alert by matching the hostname.</td></tr>
	<tr><th align="left" valign="top">SERVICE</th><td>rule matching an alert by the service name.</td></tr>
	<tr><th align="left" valign="top">EXSERVICE</th><td>rule excluding an alert by matching the hostname.</td></tr>
	<tr><th align="left" valign="top">COLOR</th><td>rule matching an alert by color. Can be "red", "yellow", or "purple".</td></tr>
	<tr><th align="left" valign="top">TIME</th><td>rule matching an alert by the time-of-day. This is specified as the DOWNTIME timespecification in the hosts.cfg file (see hosts.cfg(5)).</td></tr>
	<tr><th align="left" valign="top">DURATION</th><td>Rule matching an alert if the event has lasted longer/shorter than the given duration. E.g. <em>DURATION&gt;10m</em> (lasted longer than 10 minutes) or <em>DURATION&lt;2h</em> (only sends alerts the first 2 hours). Unless explicitly stated, this is in minutes - you can use 'm', 'h', 'd' for 'minutes', 'hours' and 'days' respectively.</td></tr>
	<tr><th align="left" valign="top">UNMATCHED</th><td>This keyword on a recipient means that he will only get an alert, if no other alerts have been sent. So you can use it e.g. when setting up alerts to specific people for some services, then after those you add a recipient with the UNMATCHED keyword who will only get those alerts that were not sent anyone else. You can also use it to setup a "catch-all" alert recipient, use the UNMATHED keyword on a recipient at the end of the alerts.cfg file.</td></tr>
	<tr><th align="left" valign="top">RECOVERED</th><td>Rule matches if the alert has recovered from an alert state.</td></tr>
	<tr><th align="left" valign="top">NOTICE</th><td>Rule matches if the message is a "notify" message. This type of message is sent when a host or test is disabled or enabled.</td></tr>
</table>
<p>These are the keywords for specifying a recipient:</p>
<table width="80%" align="center" summary="alerts.cfg keywords">
	<tr><th align="left" valign="top">MAIL</th><td>Recipient who receives an e-mail alert. This takes one parameter, the e-mail address.</td></tr>
	<tr><th align="left" valign="top">SCRIPT</th><td>Recipient that invokes a script. This takes two parameters: The script filename, and the recipient that gets passed to the script.</td></tr>
	<tr><th align="left" valign="top">IGNORE</th><td>Recipient that does NOT send an alert, and will cause Xymon to stop looking for any more recipients. See the example below.</td></tr>
	<tr><th align="left" valign="top">FORMAT</th><td>format of the text message with the alert. Default is "TEXT" (suitable for e-mail alerts). "PLAIN" is the same as TEXT, except it does not include the URL linking to the status webpage. "SMS" is a short message with no subject for SMS alerts. "SCRIPT" is a brief message template for scripts.</td></tr>
	<tr><th align="left" valign="top">REPEAT</th><td>How often an alert gets repeated. As with the DURATION setting, this is in minutes unless explicitly modified with 'm', 'h', 'd'.</td></tr>
	<tr><th align="left" valign="top">STOP</th><td>By default, xymond_alert looks at all the possible recipients in the alerts.cfg file when handling an alert. If you would like it stop after a specific recipient gets an alert, add the STOP keyword to this recipient. This terminates the search for more recipients.</td></tr>
</table>

<h3><a name="wildcards">Wildcards - regular expressions</a></h3>
<p>So now we can setup an alert. But using explicit hostnames is bothersome, if you have many
hosts. There is a smarter way:
<pre>
	HOST=%(www|intranet|support|mail).foo.com
		MAIL webmaster@foo.com SERVICE=http REPEAT=1h
		MAIL unixsupport@foo.com SERVICE=cpu,disk,memory
</pre>
<p>The percent-sign indicates that the hostname should not be taken literally - instead,
<tt>(www|intranet|support|mail).foo.com</tt> is a <i>Perl-compatible regular expression</i>.
This particular expression matches "www.foo.com", "intranet.foo.com", "support.foo.com" and
"mail.foo.com". You can use regular expressions to match hostnames, service-names and page-names.</p>

<p>If you want to test how your alert configuration handles a specific host, you can run xymond_alert in <b>test</b> mode - you give it a hostname and servicename as input, and it will go through the configuration and tell you which rules match and who gets an alert.</p>
<pre><tt>
	osiris:~ $ cd server/
	osiris:~/server $ ./bin/xymoncmd xymond_alert --test osiris.hswn.dk cpu
	Matching host:service:page 'osiris.hswn.dk:cpu:' against rule line 109:Matched
	    *** Match with 'HOST=*' ***
	Matching host:service:page 'osiris.hswn.dk:cpu:' against rule line 110:Matched
	    *** Match with 'MAIL henrik@sample.com REPEAT=2 RECOVERED COLOR=red' ***
	Mail alert with command 'mail -s "XYmon [12345] osiris.hswn.dk:cpu is RED" henrik@sample.com'
</tt></pre>

<h3><a name="scripts">If e-mail is not enough</a></h3>
<p>The <em>MAIL</em> keyword means that the alert is sent in an e-mail. Sometimes this ends
up being an SMS to your cell-phone - there are several "e-mail to SMS" gateways that perform
this service - but that may not be what you want to do. And also, for an e-mail to actually
be delivered requires that the mail-server is working. So if you need full control over how
alerts are handled, you can use the <em>SCRIPT</em> method instead. Here's how:
<pre>
	HOST=%(www|intranet|support|mail).foo.com SERVICE=http
		SCRIPT /usr/local/bin/smsalert 4538761925 FORMAT=sms
</pre>
<p>This alert doesn't go out as e-mail. Instead, when an alert needs to be delivered, Xymon
will run the script <tt>/usr/local/bin/smsalert</tt>. The script can use data from a series of 
environment variables to build the information it sends in the alert, depending on what the 
recipient can handle. E.g. for pagers you will typically just send a sequence of numbers - 
Xymon provides things like the IP-address of the server that has a problem and a numeric code 
for the service to the script. So a simple script to send an SMS alert with the "sendsms" 
tool could look like this:</p>
<pre>
	#!/bin/sh

	/usr/local/bin/sendsms $RCPT "$BBALPHAMSG"
</pre>
<p>Here you can see the script use two environment variables that Xymon sets up for the script: The
<em>$RCPT</em> is the recipient, i.e. the phone-number "4538761925" that is in the alerts.cfg
file. The <em>$BBALPHAMSG</em> is text of the status that triggers the alert.</p>

<p>Although $BBALPHAMSG is nice to have, not all recipients can handle the large messages that may
be sent in the status message.  The <tt>FORMAT=sms</tt> tells Xymon to change the BBALPHAMSG into
a form that is suitable for an SMS message - which has a maximum size of 160 bytes. So Xymon picks
out the most important bits of the status message, and puts as much of that as possible into the
BBALPHSMSG variable for the script.</p>
<p>The full list of environment variables provided to scripts are as follows:</p>
<table width="80%" align="center" summary="Paging script environment variables">
	<tr><th align="left" valign="top">BBCOLORLEVEL</th><td>The current color of the status</td></tr>
	<tr><th align="left" valign="top">BBALPHAMSG</th><td>The full text of the status log triggering the alert</td></tr>
	<tr><th align="left" valign="top">ACKCODE</th><td>The "cookie" that can be used to acknowledge the alert</td></tr>
	<tr><th align="left" valign="top">RCPT</th><td>The recipient, from the SCRIPT entry</td></tr>
	<tr><th align="left" valign="top">BBHOSTNAME</th><td>The name of the host that the alert is about</td></tr>
	<tr><th align="left" valign="top">MACHIP</th><td>The IP-address of the host that has a problem</td></tr>
	<tr><th align="left" valign="top">BBSVCNAME</th><td>The name of the service that the alert is about</td></tr>
	<tr><th align="left" valign="top">BBSVCNUM</th><td>The numeric code for the service. From SVCCODES definition.</td></tr>
	<tr><th align="left" valign="top">BBHOSTSVC</th><td>HOSTNAME.SERVICE that the alert is about.</td></tr>
	<tr><th align="left" valign="top">BBHOSTSVCCOMMAS </th><td>As BBHOSTSVC, but dots in the hostname replaced with commas</td></tr>
	<tr><th align="left" valign="top">BBNUMERIC</th><td>A 22-digit number made by BBSVCNUM, MACHIP and ACKCODE.</td></tr>
	<tr><th align="left" valign="top">RECOVERED</th><td>Is "1" if the service has recovered.</td></tr>
	<tr><th align="left" valign="top">DOWNSECS</th><td>Number of seconds the service has been down.</td></tr>
	<tr><th align="left" valign="top">DOWNSECSMSG</th><td>When recovered, holds the text "Event duration : N" where N is the DOWNSECS value.</td></tr>
</table>
<p>This set of environment variables are the same as those provided by Big Brother to custom
paging scripts, so you should be able to re-use any paging scripts written for Big Brother
with Xymon.</p>

<h3><a name="macros">Save on the typing - use macros</a></h3>
<p>Say you have a long list of hosts or e-mail adresses that you want to use several times throughout the
alerts.cfg file. Do you have to write the full list every time ? No:<br>
<pre>
	$WEBHOSTS=%(www|intranet|support|mail).foo.com 
	
	HOST=$WEBHOSTS SERVICE=http
		SCRIPT /usr/local/bin/smsalert 4538761925 FORMAT=sms

	HOST=$WEBHOSTS SERVICE=cpu,disk,memory
		MAIL unixsupport@foo.com
</pre><br>
The first line defines <em>$WEBHOSTS</em> as a <em>macro</em>. So everywhere else in the file,
&quot;$WEBHOSTS&quot; is automatically replaced with &quot;&#37;(www|intranet|support|mail).foo.com&quot;
before the rule is processed. The same method can be used for recipients, e.g. e-mail adresses.
In fact, you can put an entire line into a macro:<br>
<pre>
	$UNIXSUPPORT=MAIL unixsupport@foo.com TIME=*:0800:1600 SERVICE=cpu,disk,memory

	HOST=%(www|intranet|support|mail).foo.com 
		$UNIXSUPPORT

	HOST=dns.bar.com
		$UNIXSUPPORT
</pre>
<p>would be a perfectly valid way of specifying that <tt>unixsupport@foo.com</tt> gets 
e-mailed about cpu-, disk- or memory-problems on the foo.com web-servers, and the
bar.com dns-servers.</p>

<p>Note: Nesting macros is possible, except that you must define a macro before
you use it in a subsequent macro definition.</p>

<h3><a name="ignorerules">There are rules ... and exceptions: IGNORE</h3>
<p>A common scenario is where you handle most of the alerts with a wildcard rule, but
there is <i>just</i> that one exception where you dont want any cpu alerts
from the marketing server on Thursday afternoon. Then it is time for the 
IGNORE recipient:<br>
<pre>
	HOST=* COLOR=red
		IGNORE HOST=marketing.foo.com SERVICE=cpu TIME=4:1500:1800
		MAIL admin@foo.com
</pre>
<p>What this does is it defines a general catch-all alert: All red alerts
go off to the admin@foo.com mailbox. There is just one exception: When the
marketing.foo.com alerts on the "cpu" status on Thursdays between 3PM and
6PM, that alert is ignored. The IGNORE recipient implicitly has a STOP 
flag associated, so when the IGNORE recipient is matched, Xymon will stop
looking for more recipients - so the next line with the MAIL recipient is
never looked at when handling that busy marketing server on Thursdays.</p>
</body>
</html>

